Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis

نویسندگان

  • Hannah Bast
  • Georges Dupret
  • Debapriyo Majumdar
  • Benjamin Piwowarski
چکیده

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car fruit), symmetrically related (car automobile), asymmetrically related with the first term being more specific than the second (banana fruit), and asymmetrically related in the other direction (fruit banana).We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

شناسایی خودکار سبک موسیقی

Nowadays, automatic analysis of music signals has gained a considerable importance due to the growing amount of music data found on the Web. Music genre classification is one of the interesting research areas in music information retrieval systems. In this paper several techniques were implemented and evaluated for music genre classification including feature extraction, feature selection and m...

متن کامل

Increasing the Coverage of Medicinal Chemistry-Relevant Space in Commercial Fragments Screening

Analyzing the chemical space coverage in commercial fragment screening collections revealed the overlap between bioactive medicinal chemistry substructures and rule-of-three compliant fragments is only ∼25%. We recommend including these fragments in fragment screening libraries to maximize confidence in discovering hit matter within known bioactive chemical space, while incorporation of nonover...

متن کامل

Long-term Iran's inflation analysis using varying coefficient model

Varying coefficient Models are among the most important tools for discovering the dynamic patterns when a fixed pattern does not fit adequately well on the data, due to existing diverse temporal or local patterns. These models are natural extensions of classical parametric models that have achieved great popularity in data analysis with good interpretability.The high flexibility and interpretab...

متن کامل

The chemotaxonomic classification of Rhodiola plants and its correlation with morphological characteristics and genetic taxonomy

BACKGROUND Rhodiola plants are used as a natural remedy in the western world and as a traditional herbal medicine in China, and are valued for their ability to enhance human resistance to stress or fatigue and to promote longevity. Due to the morphological similarities among different species, the identification of the genus remains somewhat controversial, which may affect their safety and effe...

متن کامل

Study on Application of Two Different Magnetic Materials in Rotor of Cylindrical Synchronous Generator to Produce Reluctance Torque

Synchronous generators are of two type’s salient pole type and round rotor type. The load angle curve of a cylindrical rotor synchronous machine comprises a single sine term only while in salient pole synchronous generators, power-angle characteristic has two terms. The first term is the fundamental component due to field excitation (the same as the cylindrical rotor) and the second term ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005